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ABSTRACT 

Microbial genome sequencing is one of the longest- 
standing areas of biological database development, 
but high-throughput, low-cost technologies 
have increased its throughput to an unprecedented 
number of new genomes per year. Several thousand 
microbial genomes are now available, necessitating 
new approaches to organizing information on gene 
function, phylogeny and microbial taxonomy to 
facilitate downstream biological interpretation. 
MetaRef, available at http://metaref.org, is a novel 
online resource systematically cataloguing a 
comprehensive pan-genome of all microbial clades 
with sequenced isolates. It organizes currently 
available draft and finished bacterial and archaeal 
genomes into quality-controlled clades, reports all 
core and pan gene families at multiple levels in the 
resulting taxonomy, and it annotates families' 
conservation, phylogeny and consensus functional 
information. MetaRef also provides a comprehen- 
sive non-redundant reference gene catalogue for 
metagenomic studies, including the abundance 
and prevalence of all gene families in the >700 
shotgun metagenomic samples of the Human 
Microbiome Project. This constitutes a systematic 
mapping of clade-specific microbial functions 
within the healthy human microbiome across 
multiple body sites and can be used as reference 
for identifying potential functional biomarkers in 
disease-associate microbiomes. MetaRef provides 
all information both as an online browsable 
resource and as downloadable sequences and 
tabular data files that can be used for subsequent 
offline studies. 



INTRODUCTION 

High-throughput sequencing has become an invaluable 
tool for scientific investigation, and computational 
resources are still adapting to the ubiquity and scale of 
modern genomic resources. This has been particularly 
true for microbial sequencing, where new methods must 
be developed to organize tens of thousands of genomes, 
hundreds of isolates per species and a pan-genome that 
comprises millions of gene families. Many databases 
catalogue microbial sequences, but few yet place these 
sequences within their phylogenetic and functional 
context. 

For example, many microbial clades have now been 
shown to have surprisingly large pan-genomes in 
contrast to the size of their core genomes (1). Any one 
Escherichia coli isolate typically contains ~4700 genes, 
but only some 2000 of these are found in all E. coli, and 
the pan-genome selected from all strains in the species now 
greatly exceeds 8000 genes (2). Yet current microbial 
genome resources rarely indicate the core or pan- 
genomes of a clade of interest, nor do they conversely 
assess the phylogenetic distribution of individual gene 
families. 

No comprehensive tool is thus available for interroga- 
tion of microbial clade-specific gene sequences, famihes or 
functional annotations. Here, we describe the algorithms, 
database and online interface for such a framework, ini- 
tially detaihng over 10 million genes from ~2800 micro- 
bial genomes. All information is both browsable 
interactively through an online web interface and down- 
loadable for offline analysis, and the underlying data are 
automatically updated every 6 months. 

OVERVIEW OF THE METAREF DATABASE 

MetaRef is a novel onhne resource reporting a sys- 
tematic sequence-based catalogue of the diversity and 
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characteristics of the gene repertoires of all microbial clades 
with available whole-genome sequence reference informa- 
tion. It identifies and reports each gene family present in at 
least one genome, organizing them into pan, core or marker 
families, reconciles their functional annotations when 
possible, and systematically surveys their presence in 
multiple body sites of the human microbiome. 

MetaRef serves as a comprehensive and 
convenient resource for tasks in microbial genome inves- 
tigations, comparative genomics, and quantitative 
microbial ecology and offers several features not 
provided by existing systems. The MetaRef microbial 
gene family system (defined at 80% full-length nucleotide 
identity) includes every annotated reading frame in cur- 
rently sequenced genomes, including singleton and 
uncharacterized genes. The underlying gene family cluster- 
ing algorithm is defined hierarchically so as to scale effi- 
ciently to the increasing number of sequenced genomes 
(currently >2800); many existing catalogues such as 
COG (3) and KEGG KOs (4) neither match this scale 
nor typically annotate more than a fraction of genes in 
any one genome. Unsupervised approaches such as 
KEGG OC (5), PHOG (6) and OMA (7) face scalabihty 
issues and are often only appropriate for finished 
genomes. The largest current databases, such as MBGD 
(8), eggNOG (9) and UniRef (10), extend these systems to 
all proteins in a larger number of input genomes, but uni- 
formly at the cost of a loss of transferable information — 
all provide comprehensive definitions of microbial gene 
families, but none hnk these families to phylogenetic and 
functional information for downstream interpretation. 

The novel resource provides information useful for at 
least two different classes of investigator. For microbiolo- 
gists focusing on individual microbial clades (e.g. a genus 
or a species), MetaRef provides a pre-computed pan- 
genome comprising core families consistently present in 
the clade, marker families present uniquely in the clade, 
and pan families present only in some clade members. This 
constitutes a tree-of-hfe-wide resource for phylogenetic 
analysis (e.g. by ahgning and concatenating MetaRef 
core famihes) with several additional features unique to 
the MetaRef system including consensus-based functional 
characterization and pre-computed conservation/diversity 
scores. Second, for investigators focusing on large-scale 
comparative genomics or ecology, MetaRef also 
provides a catalogue of pre-identified marker genes for 
microbiome taxonomic profiling (11), and the means to 
assess the relevance of any microbial gene families with 
respect to their symbiotic healthy relationship with the 
human body, as the system reports the abundance and 
prevalence of each family in each clade in six body sites 
of the healthy human microbiome integrating shotgun 
metagenomic data produced by the Human Microbiome 
Project (HMP) (12). No existing resources for microbial 
genomics provide both this level of clade-specific detail 
and whole-microbiome analysis. 

MetaRef thus provides a new resource based on auto- 
matic and unsupervised complete gene clustering, process- 
ing a highly scalable number of sequenced microbial 
genomes. Both final and draft genomes are included in 
this process; open reading frame (ORE) calls are the 



only necessary annotations. The MetaRef web interface 
has been developed to provide a variety of clade-centric 
and gene family-centric interactive views to access and 
explore the sequenced microbial diversity. All MetaRef 
information, including sequence data, annotations and 
metagenomic mapping, is also available for download in 
plain text format for offline processing. While MetaRef 
provides novel aspects such as clade-specificity for the 
gene family system andmetagenomic surveys of each 
gene family considered, it also combines four features cur- 
rently not present simultaneously in other resources: (i) it 
comprehensively processes aU available microbial genomic 
data; (ii) it provides both onhne and downloadable 
sequence information; (iii) it efficiently scales to the 
increasing throughput of reference genomes; and (iv) it 
leverages functional and phylogenetic information to 
gain insights into each gene family. 

DATABASE GENERATION, CONTENT AND 
DEFINITIONS 

The MetaRef database and online resource are built upon 
publicly available microbial genomic and metagenomic 
information automatically acquired and processed on a 
regular basis (Eigure 1). The computational pipelines we 
implemented process the ~3000 microbial genomes 
available (from IMG (13)) with the associated taxonomic 
and gene information and the ~3.5TB of shotgun 
metagenomic data of 757 human microbiome samples 
(from the HMP (14)). These pipelines include PhyloPhlAn 
(15) for taxonomic curation and a novel clustering pipehne 
for pan gene classification (11,16) (see Supplementary 
Methods). The second phase of computational processing 
comprises metagenome mapping using a Bowtie-based 
strategy (17), a novel taxonomic and phylogenetic tree 
visualization (GraPhlAn, publicly available at http:// 
huttenhower.sph.harvard.edu/graphlan), and the deriv- 
ation of consensus functional annotations. The MetaRef 
onhne interface is a dynamic website with a back-end 
MySQL relational database, developed using Django 
(version 1.4.3 https://www.djangoproject.com); (see 
Supplementary Methods for implementation details). 

The MetaRef system: definition of mariner, core, and pan 
genes and families 

The phylogenetic diversity within each microbial taxo- 
nomic clade is captured by cataloguing the overall reper- 
toire of distinct OREs contained in the pan-genome 
(18,19). MetaRef naming conventions refer to individual 
microbial ORE sequences as genes, and groups of genes 
related by some homology criterion as famihes. Pan genes 
are defined as OREs present at least once in at least one 
genome of a clade of interest. Pan genes are usually 
grouped into homologous groups, called pan famihes, 
which capture a combination of paralogy, orthology and 
horizontal gene transfer relationships through sequence 
clustering requiring a fuU gene length similarity threshold 
(20). Among the pan genes forming a pan family, the ORE 
minimizing the dissimilarity with respect to all the other 
pan genes is called the pan centroid. 
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Figure 1. MetaRef comprises automated processing of available microbial genomic information, downloadable phylogenetic and functional anno- 
tations, and the MetaRef web portal. The implemented computational pipeline first processes the available microbial reference genomes with the 
associated taxonomic metadata (currently from IMG) to produce the comprehensive MetaRef sequence database of all gene families. The database is 
then integrated with the available functional annotations and shotgun metagenomic data (currently from the HMP project) to provide clade-centric 
and family-centric views of the current microbial diversity. 



Some pan families are conserved within all organisms of 
a clade and are thus defined as core famihes (Figure 2). 
When a core family for a clade (say the Staphylococcus 
aureus species) is not conserved in its direct ancestor clade 
(in this case the genus Staphylococcus), it is of biological 
interest as a potentially conserved function distinguishing 
the clade from sibhng clades (e.g. Staphylococcus 
epidermidis and other Staphylococcus spp.) and is called 
a crown family (Figure 2). A crown family can still be 
simultaneously present in unrelated branches of the 
taxonomy, possibly as a consequence of evolutionarily 
successful horizontal gene transfer events or complex 
gene loss patterns. However, when a crown family 
appears to be uniquely present in a given clade, we 
define it as a marker family (11), containing genes that 
represent specific sequences and, possibly, specific func- 
tions universally distinguishing the clade within the full 
microbial phylogeny (Figure 2). 



MetaRef families consist of the union of all pan, core, 
crown and marker families that determine the full micro- 
bial gene repertoire. MetaRef families can in the largest 
cases contain hundreds or thousands of genes, in which 
case it is convenient to use a unique representative 
sequence for the whole family, referred to as the family 
centroid (as for the pan famihes introduced above). 
Singleton families consisting of only one gene cannot fall 
into any of these categories and are excluded from core, 
crown and marker family definitions. Overall, MetaRef 
comprises ~5M families, 3.6 M of which are core 
families and ~ 1 M are markers as reported in Table 1 
together with the statistics of some representative clades. 

Organization of the MetaRef interactive web resource 

The MetaRef website provides two basic routes for 
data browsing: clade-centric and gene family-centric. 
The clade-centric view presents data based on the 
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Figure 2. Core and marker family definitions. MetaRef defines three types of famifies made up of individual genes, represented here on a cartoon 
taxonomic tree in which leaf nodes represent strains/genomes. Core families contain genes consistently present within a clade, crown families are 
cores for which the clade is the lowest common ancestor, marker families are the subset of core families uniquely present only in a specific clade, and 
pan families comprise all genes found at least once in a sequenced microbial genome. The two trees exemplify these definitions for a gene G present 
in the genomes marked with a small black triangle. Specifically, on the left panel, G is a crown gene for clade Y as it is present in all its leaves and it 
is not a core gene in any ancestor of Y. The presence of G outside clade Y does not affect the definition of Y as a crown gene, and for clade Yl, Y2 
and Y3, gene G is a core gene. On the right panel, gene G is instead a marker gene for clade Y as G is never present in genomes outside clade Y. 



Table 1. MetaRef family statistics for representative phyla, genera and species 





Species 


Genomes 


Genes 


MetaRef families 


Tot. Pan families 


Tot. core families 


Tot. marker families 


All microbes 


1184 


2818 


10880 874 


5 006295 


376 947 


3 600 814 


1 028 534 


Proteobacteria 


463 


1162 


4 560 201 


1 802 529 


408 957 


1 078 082 


315490 


Bacteroidetes 


88 


124 


474420 


294243 


236924 


48040 


9279 


Actinobacteria 


148 


272 


999 822 


582201 


38 028 


435935 


108 238 


Firmicutes 


302 


838 


2 600128 


890666 


211005 


452615 


227 046 


Staphylococcus 


11 


92 


248 936 


26206 


16204 


9125 


877 


Streptococcus 


32 


132 


277611 


52 673 


27 072 


23171 


2430 


Bacteroides 


22 


45 


206420 


70775 


45 843 


22 774 


2158 


Strep, pneumoniae 




42 


93 431 


4341 


2733 


1579 


29 


Staph, aureus 




72 


198 655 


5301 


3137 


1973 


191 



Singleton families are not reported. 



hierarchically structured taxonomy clades. The gene 
family-centric view provides biological information of in- 
dividual MetaRef gene families in great detail. We recog- 
nize the intrinsic limitation of onhne low-throughput data 
navigation; therefore, throughout the site, the underlying 
data reported on screen can always be downloaded in 
plain text and FASTA format for further offline analysis. 

The clade-centric view is organized according to the 
hierarchical (parent-child) relationships between clades 
and is rooted at two domain levels (Bacteria and 
Archaea) for which MetaRef currently catalogues 
2706 and 112 genomes, respectively. The clade-centric 
webpage for a given query clade (example in 
Supplementary Figure SI) indicates the full taxonomic 
hneage of that clade and links to pages describing aU 
related clades. When the query clade is at the family, 
genus or species level, a summary table of the core/pan 
genes of all sequenced genomes in the clade is reported. 
Links provided from this table connect to the gene family- 
centric view for further gene-specific inquiry. 



To indicate how much sequencing data are currently 
available in the query clade, the total number of 
genomes in the clade and its direct child clades are 
reported and hyperhnked to genome pages. A circular 
genome tree of the query clade is provided to help visual- 
ize the number and distribution of genomes (the leaf 
nodes) sequenced in the clade and its direct child clades. 
When the query clade is itself a genome, more information 
is reported, including a table that summarizes the number 
of genes in that genome that are core at which ancestor 
clade and how many genes are unique markers of the 
genome. Additionally, a download link gives access to a 
detailed tab-delimited table describing all genes in the 
genome. The original external source of the genome is 
also provided for more inquiry. 

The family-centric view (Figure 3) focuses on biological 
information for each individual MetaRef gene family. 
Information reported on the gene family page includes 
functional annotations aggregated from individual gene 
members and external functional databases; summary 
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Gene Family view of a S. aureus core-gene reported on the MetaRef web system 

(web-panels are described and visually re-arranged for compactness) 
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Genomes 


Score 


Staphylococcus aureus 
(Species) 


71/72 


0.9995 



Taxonomic and conservation 
information for the core-family 



Gene Description (up to the top 5) Counts 

cytochrome dubiquind CMdase subunit I 16 

cytochrome bd-l oxidase subunit I 12 

cytochrome D ubiquind OMdase.subunit I 9 

hypothetical protein 4 

cytochrome dubiquindcMdase, subunit I 4 



Consensus annotations from the information 
associated to each family member 

For this particular famiiy, four family members have 
been previously described as hypothetical proteins 
and are now functionally charactherized 



Db 


Description 


Counts 


KEGG 


KO:K00425 cytochrcme bd-l oadase subunit 1 [EC:1.10,3,-] 


71 


COG 


COG1271 Cytochrome bd-type quind oxidase, subunit 1 


71 


EC 


EC:1.10.3.- With oxygen as acceptor. 


71 



Annotations collected from family members 
are summarized in two tables: the gene 
description table and the functional database 
table (i.e. KEGG, GOG, EC assignments). 



HMP Metagenomic Data 

Prevalence 



Log{Relative Abundance, %) 



50 75 
preva/ence % 



gut tDngue nose 



HMP mapping results of this gene family are summarized in two plots: 

1) the bar plot shows the presence and prevalence this genecore-gene, and 

2) the boxplot shows the relative abundance of this core-gene 
in the major human body sites. 




The genomes that have this core-gene are 
highlighted on the zoomable genome tree. 
Homologs outside the Staphylococcus 
aureus clade are also highlighted 



Figure 3. Summary of the main MetaRef panels reported in the family-centric view. For each gene family, MetaRef reports taxonomic, phylogenetic 
and conservation information as well as available functional information, additional consensus annotations, and the family's prevalence and abun- 
dance in human-associated microbiomes. The information can further be analysed interactively (by exploring sub-branches of the microbial phyt- 
ogeny) and offline (by downloading sequence FASTA files and gene annotation tables for the family). 



plots of the prevalence and abundance of the gene as 
found at the major HMP body sites; and reports on 
homologs of the query gene family found outside of its 
clade. Access to taxonomic clades and MetaRef families is 
also possible through keyword search (see Supplementary 
Methods). 

Consensus functional annotation of MetaRef families 

MetaRef gene families were defined solely using sequence 
homology with stringent criteria (80% fuh-length identity, 
see Supplementary Methods). The genes clustered into a 
family were therefore hkely to be genes that would carry 
out the same biological functions. However, we observed 
that functional annotation of individual members within a 
family were, at times, inconsistent, often stemming from 
the fact that individual genomes were annotated by differ- 
ent methods at different times, with varying criteria and 
terminology. Consolidating annotation information from 
individual gene members within a family helps improve 
consistency of annotation interpretation across genes 
and provides putative annotation corrections in many 
cases. Additionally, MetaRef directly links a gene family 



with corresponding taxonomic clades, which is crucial 
when focusing on the property of a function encoded by 
a gene restricted to a specific group of microbes sharing a 
common ancestor (e.g. studying a given gene family in 
E. coli rather than on aU microbes possessing the gene). 

The current MetaRef implementation provides a simple 
and conservative effort towards consensus annotation 
by tallying identical annotations from all available anno- 
tation schemes, including free-text deflines, KO, COG and 
EC assignments. Each family's resulting table is dynamic- 
ally and automatically generated from the consensus of its 
gene members. By reviewing the tally table, functions of 
some unknown/hypothetical protein genes have become 
clearer and recognizable. For example, a 71 gene core 
family of S. aureus (reported in Supplementary Figure 
S2) is consistently annotated as a putative siderophore 
biosynthesis protein subunit, although eight individual 
gene members were previously only annotated as hypo- 
thetical proteins. In another case (Supplementary Figure 
S2), two DNA mismatch repair proteins in a core family 
are labelled as HexA, whereas aU the other members of the 
family (70) are labelled as MutS, suggesting a 
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misannotation. This potential inconsistency has been con- 
firmed by ad hoc sequence-based analysis, which showed 
that both of the former two proteins possess the MutS_l, 
MutS_2 and MutS_3 superfamihes and the ABC MutSl 
domains and that they differ from several consistently 
annotated MutS proteins by only one peptide along the 
full-length of 872 peptides. These examples confirm the 
usefulness of MetaRef efforts towards large-scale 
sequence-based consensus annotation. 

MetaRef characterization of metagenomic samples 

Comprehensive studies of human-associated microbial 
communities are still in their first generation and have 
yet to be weU-integrated with microbial isolate sequence 
resources such as the Human Microbiome Project's 
(HMP) Data Analysis and Coordination Center (21) 
(http://hmpdacc.org). MetaRef includes what we believe 
to be the first comprehensive gene family-centric reference 
generated from a large cohort of healthy human subjects: 
using our pre-computed database of family centroids 
along with the metagenomes sequenced by the HMP, we 
have compiled baseline data characterizing the relative 
abundance and prevalence of all gene famihes across aU 
major body sites in healthy individuals. Investigators 
studying site-specific pathologies, for example, can use 
this data to efficiently characterize differences in func- 
tional potential (at the individual gene level) between 
healthy- and disease-associated microbial communities 
(22) by looking for significant differences in observed 
prevalence/abundance of detected genes (23). It has been 
shown (11) that marker centroids can be rehably used to 
estimate the presence and relative abundance of their 
associated clades. As a result, a single ahgnment pass 
of a collection of pathology-associated metagenomic 
sequence reads against the full collection of MetaRef cen- 
troids — which is then compared to analogous data 
compiled from a cohort of healthy subjects at the same 
body site — can be interrogated to detect both taxonomic 
and functional differences between health- and disease- 
associated microbiota. 

To construct our baseline reference data, we first used 
Bowtie (17) to align all sequence reads from 757 shotgun 
metagenomic HMP samples — covering all five studied 
body areas (airways, GI tract, mouth, skin and urogenital 
tract, including their various subsites) — against 5 006 294 
family centroid sequences (Figure 4). For each sample, we 
counted all matches to each centroid, then converted these 
raw counts to RPKMs (24) to normalize with respect both 
to differences in inter-sample sequencing depth and differ- 
ences in individual centroid lengths. Using these 
normalized counts, we calculated, for each sample, the 
relative abundance of each centroid as a percentage of 
all normalized counts. For aU body sites, we computed 
the mean relative abundance (over all non-zero samples) 
of each centroid detected at that site, to generate a single 
statistic estimating the typical relative abundance of each 
centroid at that site. The presence or absence of each 
centroid at each site was also used to compute prevalence 
statistics: for each centroid and each body site in which it 
was found, we report the percentage of samples in which 



that centroid was detected. Body-site-based relative abun- 
dance and prevalence statistics are summarized (as box 
plots and bar plots, respectively) on individual centroid 
pages on the MetaRef website (cf. Figure 3). Although 
for completeness of the downloadable resource we 
ahgned all 757 HMP samples against the centroid collec- 
tion, the prevalence and abundance data provided by 
MetaRef reflects the subset of 630 samples that passed 
HMP's quahty-control procedures for the six body sites 
considered. Please see the Supplementary Methods for a 
discussion of our ahgnment strategy, precise details of the 
ahgnment operations, and a summary of the results of 
validation tests we conducted to ensure that our results 
were sufficiently representative of the sampled microbial 
communities. 

The number of centroids detected in each sample that 
passed QC ranged from 2563 (from a sample taken from 
the posterior vaginal fornix: when healthy, such samples 
exhibit the lowest-diversity microbe communities of all 
studied body sites (25)) to 4128821 (from a distal gut 
sample, the most complex of aU human-associated micro- 
bial communities (25)), with a median of 94 135 (from an 
oral sample). As examples, one core gene (encoding a 
sigma factor subunit of RNA polymerase) for 
Bacteroides ovatus — a bacterium commonly found in the 
human gut (26,27) — was observed in 81.53% of all GI- 
tract samples, and in <5% of samples sequenced from 
all other body sites. Its average relative abundance 
within gut samples was more than two orders of magni- 
tude higher than its relative abundance in aU other body 
sites. Similarly, a core gene (encoding a cell division 
protein) for S. epidermidis — a common commensal 
colonist of the human nasal passage (28,29) — was seen 
in ~25% of all nasal samples, but <3% of samples from 
other body sites; its average relative abundance in nasal 
samples was again at least two orders of magnitude 
greater than in samples from any other body site. 
Comprehensive prevalence/abundance statistics for all 
family centroids (including markers) have been made 
available for direct download from the MetaRef website, 
and graphical reports for individual centroids can be 
browsed directly (Supplementary Figure S3). These can 
be conveniently used as a robust collection of healthy 
basehne data for any research involving the human 
microbiome, whether the focus be on taxonomic compos- 
ition, functional potential or both. 



DISCUSSION 

With MetaRef.org, we provide the most comprehensive 
and up-to-date database of a non-redundant reference 
gene catalogue that not only delivers information on 
gene family conservation, but also captures phylogeny 
and consensus functional annotation. It is easily access- 
ible as an onhne browsable resource allowing quick func- 
tional and phylogenetic explorations of gene famihes, and 
its contents are also available by direct download for 
large-scale bioinformatics applications. MetaRef is 
committed to providing a comprehensive microbial gene 
family resource, and thus includes all systematically 
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Figure 4. Overview of MetaRef s statistical processing of microbial gene families in the healthy human body-site-specific microbiome. Raw sequence 
samples from HMP metagenomes are aligned against all MetaRef family centroids; the number of reads aligning to each centroid in each sample is 
tallied; read counts are transformed via RPKM to normalize with respect to centroid length and sampling depth; centroid RPKMs for all samples 
associated with each body site are aggregated into measures of prevalence and relative abundance for all centroids detected at that site. 



available finished and draft genomes from the time of 
creation, automatic detection of taxonomic inconsistencies 
and suggestions of plausible corrections, and is designed 
with regular automated updates at increasing scale 
in mind. We anticipate future versions including add- 
itional sources of microbial genomes and expanding 
rapidly as isolate sequencing becomes increasingly 
ubiquitous. 

With microbial sequence data in aU its diversity being 
strongly on the rise (30-32), a new approach such as 
MetaRef is needed to simultaneously organize informa- 
tion on gene function, phylogeny and microbial 
taxonomy throughout the tree of hfe to facilitate down- 
stream biological interpretation. Going forward, we also 
plan to increase the use of and connectivity to other 
genome databases, particularly those providing extended 
annotation schemes in order to continue the harmoniza- 
tion process for annotations. We will also automatically 
incorporate additional phylogenetic lineages currently 
being sequenced (32,33) as weU as additional meta'omic 
datasets (34-37), and based on current performance we 
expect the system to scale smoothly in anticipation of 
the rapid increase of available genomes. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Onhne, 
including [38-40]. 
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